Skip to content

PT TN: money, measure, telephone, electronic#416

Open
folivoramanh wants to merge 2 commits intoNVIDIA:staging/pt-br_tnfrom
folivoramanh:pt_tn_measure_money_telephone_electronic
Open

PT TN: money, measure, telephone, electronic#416
folivoramanh wants to merge 2 commits intoNVIDIA:staging/pt-br_tnfrom
folivoramanh:pt_tn_measure_money_telephone_electronic

Conversation

@folivoramanh
Copy link
Copy Markdown
Collaborator

Adds semiotic classes and tests on top of staging/pt-br_tn; includes cardinal fix for X00 + 01–09 and Sparrowhawk script updates.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

Adds semiotic classes and tests on top of staging/pt-br_tn; includes
cardinal fix for X00 + 01–09 and Sparrowhawk script updates.

Signed-off-by: Mai Anh <palasek182@gmail.com>
@folivoramanh folivoramanh force-pushed the pt_tn_measure_money_telephone_electronic branch from 42445af to 7b8ee51 Compare April 17, 2026 15:30
https://www.nvidia.com~h t t p s dois pontos barra barra w w w ponto nvidia ponto com
http://site.com.br~h t t p dois pontos barra barra s i t e ponto com ponto br
nvidia.com~nvidia ponto com
@usuario~arroba u s u a r i o No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a tsv file where we distinguish company names or common terms that should not be space-separated in this normalization (so they will behave like your l5 and l7) and let's include google and usuario in that file as well as some common companies. let's add different examples for char-separated and keep these as staying together

preserve_order = pynutil.insert(" preserve_order: true")

integer_plus_maj = pynini.union(
graph_integer + insert_space + pynutil.insert(curr_symbol) @ graph_maj_plural,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have the space with the currency symbols be optional to allow both normalizations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants